Extensions of the UNIX File Command and Magic File for File Type Identification

نویسنده

  • William Underwood
چکیده

File format identification is a core requirement for digital archives. The UNIX file command is among the most promising technologies for file type identification. This report describes extensions to the file command and magic file that enhance their utility for file format identification in archival systems. A File Format Library (database) has been created to manage information about file formats. This information includes file format name, MIME type, PRONOM Universal Identifier and file signature tests. There is a one-to-one correspondence between file formats and file signature tests. Precedence relations between file signature tests are explicitly expressed in the database. Published specifications for file formats are also collected in the library and are used to determine file signatures for the formats. When specifications have not been published for a file format, samples for files in those formats have been collected and analyzed to determine possible file signatures. File signature tests have been created for more than 800 file formats. Sample files for more than 500 of the file formats in the library have been created or collected for testing of the file signatures. These examples are included in the library The Library includes links to file format software resources that are needed in archival processing of digital records. These include: file viewers/players, archive extractors, file format converters, password recovery software and repairers for damaged files. The File Format Library supports the creation of a magic file from the file signature tests in the Library. The GTRI File Type Identifier is a graphical user interface to the file command and the magic file created from the File Format Library. The file command and magic tests have been applied to examples of 500+ file formats from the File Format Library. These tests have led to refinement of the file signature tests and discovery of the precedence relationships among file signature tests. The National Archives (TNA) of the UK provides a public registry of file format information (PRONOM). This information includes file signature patterns expressed as regular expressions. TNA also provides a tool (DROID) that uses these file signature patterns for file format identification. This approach to file type identification is also promising and seems to be primarily limited by the small number of file signature patterns in the PRONOM registry. GTRI is collaborating with TNA to enhance the content of the registry and the performance of the DROID file format identifier.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Extensions and Applications of the SDIF Sound Description Interchange Format

This paper concentrates on recent extensions and applications of the well established SDIF Sound Description Interchange Format: SDIF selection is a standard way to access a part of an SDIF file. It allows users to select SDIF frames of a certain type and time, and certain parts of the matrix data. It can be appended to a file name, making it easy to use for command-line programs. Other applica...

متن کامل

Ex Vivo Comparison of File Fracture and File Deformation in Canals with Moderate Curvature: Neolix Rotary System versus Manual K-files

Background and Aim: Cleaning and shaping is one of the important steps in endodontic treatment, which has an important role in root canal treatment outcome. This study evaluated the rate of file fracture and file deformation in Neolix rotary system and K-files in shaping of the mesiobuccal canal of maxillary first molars with moderate curvature.    Materials and Methods: In this ex vivo exp...

متن کامل

rat: A Secure Archiving Program With Fast Retrieval

A new archive format called rat was developed. This format was designed to allow very fast retrieval of individual files. This is achieved using a table of contents to quickly find the file. Each file in the archive is individually compressed with a compression method specific to the file. A user created configuration file is used to specify what type of compression to use on each file based on...

متن کامل

Content-based File-type Identification Using Cosine Similarity and a Divide-and-Conquer Approach

Identifying the file type (TXT, EXE, JPEG, etc.) is important for computer security applications such as computer forensics, steganalysis, and antivirus programs. The common approach for this is to use file extensions, magic numbers, or other header information. However, these are susceptible to tampering or corruption; for instance, the file extension can be easily spoofed and the magic number...

متن کامل

Content Based File Type Detection Algorithms

Identifying the true type of a computer file can be a difficult problem. Previous methods of file type recognition include fixed file extensions, fixed “magic numbers” stored with the files, and proprietary descriptive file wrappers. All of these methods have significant limitations. This paper proposes algorithms for automatically generating “fingerprints” of file types based on a set of known...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009